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(54) DOCUMENT CLASSIFYING DEVICE 

(57)Abstract: 

PURPOSE: To use semantic differences to automatically 
classify a document by automatically extracting feature 
vectors from the document and classifying the document 
based on these feature vectors. 

CONSTITUTION: A storage part 101 where document 
data is stored, a document analysis part 102 which 
analyzes document data, a word vector generating part 
103 which uses concurrent relations between words in 
the document to automatically generate a feature vector 
expressing the features of each word, a word vector 
storage part 104 where feature vectors are stored, a 
document vector generating part 105 which generates 
feature vectors of the document from feature vectors of 
words included in the document, a document vector 
storage part 106 where feature vectors of the document 
are stored, a classifying part 107 which uses the 
similarity between feature vectors of the document to 
classify the document, a result storage part 108 where 
the classification result is stored, and a feature vector 

generating dictionary 109 where words to be used for feature vector generation are registered 
are provided. 
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CLAIMS 



[Claim(s)] 

[Claim 1] The storage section which memorizes document data in document classification 
equipment, and the document analysis section which analyzes document data, The word vector 
generation section which generates automatically the feature vector which expresses the 
description of each word using the coincidence relation between the words in a document, The 
document vector generation section which generates the feature vector of a document from the 
feature vector of the word vector storage section which memorizes the feature vector, and the 
word contained in the document, The document vector storage section which memorizes the 
feature vector, and the classification section which classifies a document using the similarity 
between the feature vectors of a document, As a result of memorizing the classified result, it 
has the storage section and the dictionary for feature-vector generation in which the word used 
for a feature-vector generate time is registered. Document classification equipment 
characterized by the ability to generate the feature vector expressing the description of each 
word automatically using the coincidence relation between the words in a lot of text files, and 
classify a document automatically. 

[Claim 2] In addition to the configuration of the document classification equipment of claim 1 f it 
has the useful word election section which elects a useful word using the classification result 
memorized by the result storage section at the time of a classification. Document classification 
equipment characterized by the ability to raise the precision of a classification by electing a 
useful word as a classification and using only a useful word for it at a classification by 
investigating the incidence of a word for each [ which was classified ] of that taxon of every 
after classifying a lot of text files. 

[Claim 3] In the configuration of the document classification equipment of claim 1 or claim 2, in 
addition, the representation vector generation section which asks for the feature vector which 
represents each taxon using the classification result memorized by the result storage section, It 
has the representation vector storage section which memorizes the representation vector 
generated in the representation vector generation section. Document classification equipment 
characterized by the ability to ask for the feature vector which represents the field using the 
word for every taxon and feature vector of a document which were classified after classifying a 
lot of text files. 
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DETAILED DESCRIPTION 



[Detailed Description of the Invention] 
[0001] 

[Industrial Application] This invention relates to the document classification equipment used for 
the document automatic card counting sorter which is similar by preservation/automatic one, a 
word processor/filling system, etc. in a document. 
[0002] 

[Description of the Prior Art] Conventionally, the automatic classification of a document was 
difficult, and the user classified manually, or extracted the keyword in a document, and was 
classifying using the thesaurus created beforehand. Moreover, the fundamental data for a 
classification also with the system called automatic classification needed to be inputted by the 
help in forms, such as a basic example. 
[0003] 

[Problem(s) to be Solved by the Invention] however, the activity according to a help by such 
classification — a bottleneck — a sake — a lot of documents — classifying is very difficult. 
[0004] This invention was made in consideration of the above situation, and aims at offering the 
document classification equipment which classifies a document automatically through a help. 
[0005] 

[Means for Solving the Problem] The storage section invention concerning claim 1 remembers 
document data to be in document classification equipment, The word vector generation section 
which generates automatically the feature vector which expresses the description of each word 
as the document analysis section which analyzes document data using the coincidence relation 
between the words in a document, The document vector generation section which generates the 
feature vector of a document from the feature vector of the word vector storage section which 
memorizes the feature vector, and the word contained in the document, The document vector 
storage section which memorizes the feature vector, and the classification section which 
classifies a document using the similarity between the feature vectors of a document, As a 
result of memorizing the classified result, it has the storage section and the dictionary for 
feature-vector generation in which the word used for a feature-vector generate time is 
registered. The feature vector expressing the description of each word is automatically 
generated using the coincidence relation between the words in a lot of text files, and it is 
characterized by the ability to classify a document automatically. 

[0006] Moreover, in addition to the above-mentioned configuration, invention concerning claim 2 
is further equipped with the useful word election section which elects a useful word using the 
classification result memorized by the result storage section at the time of a classification. It is 
characterized by the ability to raise the precision of a classification by electing a useful word as 
a classification and using only a useful word for it at a classification by investigating the 
incidence of a word for each [ which was classified ] of that taxon of every, after classifying a lot 
of text files. 

[0007] Moreover, the representation vector generation section which asks for the feature vector 
which represents each taxon using the classification result invention concerning claim 3 is 
remembered to be by the result storage section in addition to the above-mentioned 



configuration, It has further the representation vector storage section which memorizes the 
representation vector generated in the representation vector generation section. After 
classifying a lot of text files, it is characterized by the ability to ask for the feature vector 
representing the field using the word for every taxon and feature vector of a document which 
were classified. 
[0008] 

[Function] The operation at the time of study of the feature vector of the word in claim 1 is 
explained. The contents of a lot of text files memorized by the document storage section are 
passed to the document analysis section, analyses (morphological analysis etc.) of a sentence 
are performed, coincidence relation, the frequency of occurrence, etc. of a word are analyzed in 
the word vector generation section, and the feature vector of each word is generated. In this 
way, the feature vector of the generated word is memorized by the word vector storage section. 
Thus, study of the feature vector of a word is performed. The storage space of a feature vector 
prevents becoming huge too much with restricting the word which generates a feature vector to 
the word registered into the dictionary for feature-vector generation. 

[0009] The operation at the time of a classification of the document in claim 1 is explained. The 
contents of the text file memorized by the document storage section when classifying a text are 
passed to the document analysis section, analyses (morphological analysis etc.) of a sentence 
are performed, in the document vector generation section, it asks for the feature vector of the 
word which appears when analyzing a sentence in the document analysis section with reference 
to the word vector storage section, and the feature vector of a document is generated from the 
feature vector of the word contained in a document. In this way, the feature vector of the 
generated document is memorized by the document vector storage section, and classifies a 
document according to the similarity between the feature vectors of this document in the 
classification section. This classification result is memorized by the result storage section. 
[0010] With a configuration according to claim 2, after performing a classification of a lot of 
documents, a useful word is elected in the useful word election section using the classification 
result memorized by the result storage section at the time of a classification. By classifying 
again using the feature vector of the word which was made to learn the feature vector of a word 
again after registering into the dictionary for feature-vector generation only the word elected by 
the useful word election section, then was obtained, the storage spaces of a feature vector can 
be reduced rather than the configuration of claim 1 , and the precision of a classification can also 
be raised. 

[001 1] With a configuration according to claim 3, it asks for the feature vector which represents 
each taxon with the representation vector generation section using the classification result 
memorized by the result storage section, after performing a classification of a lot of documents. 
The representation vector generated in the representation vector generation section is 
memorized by the representation vector storage section. Once it generates the representation 
vector of each taxon, when classifying new document data, it can judge to which taxon the 
document belongs only by performing the comparison with the feature vector of the document, 
and the representation vector of each taxon. 
[0012] 

[Example] Hereafter, the suitable example of this invention is explained in full detail based on a 
drawing. 

[0013] One example of invention concerning claim 1 is shown in drawi ng 1 . The storage section 
101 document classification equipment remembers document data to be, and the document 
analysis section 102 which analyzes document data, The word vector generation section 103 
which generates automatically the feature vector which expresses the description of each word 
using the coincidence relation between the words in a document, The document vector 
generation section 105 which generates the feature vector of a document from the feature 
vector of the word vector storage section 104 which memorizes the feature vector, and the 
word contained in the document, The document vector storage section 106 which memorizes the 
feature vector, and the classification section 107 which classifies a document using the similarity 
between the feature vectors of a document, As a result of memorizing the classified result, it 



consists of the storage section 108 and a dictionary 109 for feature-vector generation in which 
the word used for a feature-vector generate time is registered. 

[0014] It is more realistic to restrict the word used in case a feature vector is created, since 
there are very many words currently generally used for the usual document. For this reason, it is 
the dictionary 109 for feature-vector generation, and creating the feature vector of a word only 
using the word registered here uses, and it can suppress growing gigantic of the storage space 
of a feature vector. 

[0015] Drawing 2 shows the system configuration at the time of study of the feature vector of a 
word, at the time of study of the feature vector of a word, a lot of document data document 
storage section 101 for study is made to memorize, the document data read from the document 
storage section 101 are read into the document analysis section 102 for every suitable lumps, 
such as a report, a paragraph, and one etc. sentence, the document data is analyzed in the 
document analysis section 102, and a word is extracted. The feature vector of the word which 
generated the feature vector of a word in the word vector generation section 103 based on the 
word train extracted here, and was generated in the word vector generation section 103 is 
memorized by the word vector storage section 104. In this way, the feature vector of a word is 
learned. 

[0016] When drawing 3 shows the system configuration at the time of a document classification 
and a document is classified, the document data which the document storage section 101 was 
made to memorize the data of the document to classify, and read them from the document 
storage section 101 are read into every [ to make it classify into ] unit (for example, report unit) 
by the document analysis section 102, the document data is analyzed in the document analysis 
section 102, and a word is extracted. It asks for the feature vector of the word extracted here 
with reference to the contents of the word vector storage section of 104. Usually, although two 
or more words are extracted from one unit (for example, one report) of document data, at this 
time, the feature vector of a document is calculated by equalizing the value of the feature vector 
of all the words called for. a better value may be acquired, if it does not equalize simply, but each 
feature vector is equalized at this time after carrying out weighting according to the inverse 
number of that frequency of occurrence, investigating the number of reports in which that word 
has appeared from a lot of reports and hanging log (the number of reports in which the total 
number of reports / its word has appeared) on the feature vector of that word for example, — 
[0017] If the feature vector of a document can be found, a document can be classified according 
to applying the technique of the conventional clustering. What is necessary is just to consider 
that documents with a distance near [ this ] between the feature vectors of a document belong 
to the same field. 

[0018] Moreover, human being chooses a typical document for every taxon, the temporary 
representation vector of the taxon is generated from the feature vector of the word extracted 
from the document, and the feature vector of the document read from the document storage 
section 101 can also classify a document according to whether to be close to the temporary 
representation vector of which taxon. If document data are made to read from the document 
storage section 101 in large quantities also by such classification technique, the effect of the 
error that human being has chosen the temporary representation vector decreases, and, finally 
the quite general representation vector for each field can be generated. 

[0019] Then, the method of generating the feature vector of a word is explained concretely. The 
feature vector of a word is obtained by adding what applied the frequency of occurrence in the 
inside of the document data of the lump of the word to frequency-of-occurrence distribution of 
the word contained in the document data of a lump. A concrete example explains. 
[0020] Example A "the American government has proposed radical reexamination of the COCOM 
regulation to the advanced m^jor power." 

Example B "to which it seems that the country for regulation is inclined to reduce COCOM's 
regulated items sharply on condition that export of the industrial product which leads to 
manufacture of arms is regulated" 

It explains how the feature vector of a word is created from the document data to say. Here 
although [ document data ] read in the unit of "one sentence", other units, such as one report, 



are sufficient as this. 

[0021] moreover, the number of dimension of a feature vector — 21 dimensions (the number of 
words registered into the dictionary for feature-vector generation is 21) — each element — 
"United States, the government advanced, main, a country, COCOM, and regulation — it 
improves and suppose that radical and the word a proposal, an object, arms, manufacture, 
industry, a product, export, conditions, items, large, reduction, and intention" are supported. 
[0022] under such conditions, if Example A is read from the document storage section 101, the 
document analysis section 102 will analyze — having — "United States, the government, 
advanced, main, a country, and COCOM — radical [ regulation and ] — it improves and 
proposal" is extracted. At this time, 1 is added to the element corresponding to these words of 
the feature vectors of all these words in the word vector generation section 103. Then, the 
"United States", the "government", etc. add (1, 1, 1, 1, 1, 1, 1, 1, 1. 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) 
to the feature vector of the word which appears in Example A. The thing illustrating this is 
drawin g 8 . 

[0023] Next, if Example B is read from the document storage section 101, it will be analyzed in 
the document analysis section 102, and "regulation, an object, a country, arms, manufacture, 
industry, a product, export, regulation, conditions, COCOM, regulation, items, large, reduction, 
and an intention" will be extracted. 

[0024] The word frequency-of-occurrence distribution acquired from now on is (0, 0, 0, 0, 1, 1,3, 
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1). It adds to the feature vector of "regulation", the vector which 
doubled this word frequency-of-occurrence distribution three since "regulation" had appeared 3 
times — it is (0, 0, 0, 0, 3, 3, 9, 0, 0, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3) — An "object", a "country", 
etc. add (0, 0, 0, 0, 1, 1, 3, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) to the feature vector of the word 
which appears in Example B only once. The thing illustrating this is drawing 9 . 
[0025] In addition, since the magnitude of a vector added with the die length of a sentence by 
the approach of always adding an integer in this way changes, how to add, after normalizing the 
absolute value of a vector to add to 1, or normalizing the absolute value of a vector of 
frequency-ol^occurrence distribution to 1 and applying the value proportional to the number of 
appearances etc. is considered. 

[0026] And the feature vector finally obtained normalizes the absolute value to 1. 
[0027] In this way, the feature vector of the obtained word is memorized by the word vector 
storage section 104, and is used at the time of a classification of a document. 
[0028] Next, the time of the following examples C being read considering processing of feature- 
vector generation of the document at the time of a document classification as an example is 
raised and explained. 

[0029] Example C "the American government proposed reduction of arms." 

If Example C is read from the document storage section 101, it will be analyzed in the document 
analysis section 102, and "the United States, the government, arms, reduction, and a proposal" 
will be extracted. Then, in the document vector generation section 105, with reference to the 
contents of the word vector storage section 104, the "United States", the "government", etc. 
add the feature vector of the word which appears in Example C, and get (3, 3, 3, 3, 5, 5, 9, 3, 3, 3, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2) as a feature vector of Example C. The thing illustrating this is drawing 
10 . In drawin g 10 , since priority is given to intelligibility, the normalization of a vector is not 
performed, but in actual processing, after normalizing the absolute value of the feature vector of 
each word to 1 before adding, it adds. The obtained feature vector is memorized by the 
document vector storage group 106. 

[0030] Next, it explains how the feature vector of a document is used in the classification 
section 107 at the time of a classification. After normalizing simply the absolute value of the 
feature vector of the document which was able to be found first to 1 Although what is necessary 
is to classify using a certain technique from the former, such as the K-means method, or just to 
classify according to similarity (obtained by finding distance or calculating an inner product) with 
the representation (temporary) vector of a taxon Since the feature vector obtained by this 
technique has the description "the value of the element corresponding to the word appearing 
[ many ] becomes very large", a classification result with it better [ to devise so that this 



description may not have a bad influence on a classification ] is obtained in many cases. For 
example, distance by count which the difference between elements does not expand in finding 
distance (although the square root of the sum of squares of the difference between each 
element is usually calculated) For example, it is good to normalize, after taking log for all 
elements or taking a power root, before it is better to have used the distance which calculated 
and asked for the sum of the absolute value of the difference between each element and asking 
for an inner product, and to calculate, after leveling a value. 

[0031] The juniper currently asked for those with three, and the representation vector of each 
taxon for the taxon as follows as an example of a classification. 

[0032] The representation vector of a taxon 1 (1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 
1) 

The representation vector of a taxon 2 (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5) 
The representation vector of a taxon 3 (4, 4, 4, 4, 6, 6, 6, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 
After the feature vector of a document and the representation vector of a taxon normalize an 
absolute value to 1 , supposing what calculates both inner product and takes the biggest value is 
the highest as a scale of similarity as for similarity, it is the feature vector [0033] of Example C. 
[Equation 1] 
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[0034] (3,3,3,3,5,5,9,3,3,3,2,2,2,2,2,2.2,2,2,2,2) 
The representation vector of a taxon 1 [0035] 

[Equation 2] 

l 

V 8 

[0036] (1 ,1 ,1 ,1 ,0,0,0,0,0,0,0.0,0,0,0,0,0, 1 .1,1,1) 

The representation vector of a taxon 2 [0037] 

[Equation 3] 
1 

V 285 

[0038] (1,1,1,1,1,1,1,1,1,1 ,5,5,5,5,5,5,5,5,5,5,5) 
The representation vector of a taxon 3 [0039] 
[Equation 4] 
1 

V 21 0 

[0040] (4,4,4,4,6,6,6,3,3,3,1,1,1,1.1,1,1.1,1,1,1) 

since — the inner product of the feature vector of Example C, and the representation vector of 
each taxon — an inner product [0041] with a taxon 1 
[Equation 5] 

1= • ~=~ • 20 = 0. 4 583 
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[0042] An inner product with a taxon 2 [0043] 
[Equation 6] 

. 1 • , 1 1 5 0 = 0. 5 7 5 9 

V 2 3 8 / 2 8 5 

[0044] An inner product with a taxon 3 [0045] 
[Equation 7] 

. 1 • , 1 - 2 1 1=0. 9 4 3 8 
✓238 ✓ 2 1 0 

[0046] Since it turns out that the feature vector of a next door and Example C is the closest to 
the representation vector of a taxon 3, Example C is classified into a taxon 3. Drawin g 1 1 
illustrated this. Since priority is given to intelligibility like [ drawing 1 1 ] drawing 10 , the 



normalization of a vector is not performed, but in actual processing, after normalizing each 
absolute value of a vector to 1 before comparing, it compares. The classified result is memorized 
by the result storage section 108. 

[0047] Next, one example of claim 2 of this invention is shown in drawing 4 . Here, what is 
expressed with signs 201-209 is the same as what is expressed with the signs 101-109 of 
drawing 1 respectively. 

[0048] The storage section 201 document classification equipment remembers document data to 
be, and the document analysis section 202 which analyzes document data, The word vector 
generation section 203 which generates automatically the feature vector which expresses the 
description of each word using the coincidence relation between the words in a document, The 
document vector generation section 205 which generates the feature vector of a document from 
the feature vector of the word vector storage section 204 which memorizes the feature vector, 
and the word contained in the document, The document vector storage section 206 which 
memorizes the feature vector, and the classification section 207 which classifies a document 
using the similarity between the feature vectors of a document, As a result of memorizing the 
classified result, it consists of the storage section 208, a dictionary 209 for feature-vector 
generation in which the word used for a feature-vector generate time is registered, and the 
useful word election section 210 which elects a useful word using the classification result 
memorized by the storage section 208 the result at the time of a classification. 
[0049] Drawing 5 is drawing showing the system configuration at the time of study and a 
classification. At first, the feature vector of a word is learned and a lot of document data are 
classified according to the same approach as the example of claim 1 based on it. Although the 
classified result is memorized by the result storage section 208, it carries out based on this 
result, and a useful word is elected in the useful word election section 210. This asks for the 
frequency of each word for every taxon, removes the word contained at the same rate as every 
taxon, or elects a ********** thing for the following [ a threshold with the ratio of the (approach 
1:highest frequency and the minimum frequency ] only as removal) and a certain taxon at a high 
rate (approach 2: elect the thing beyond the highest frequency and a threshold with the second 
place of a ratio with frequency). In addition, the word which elects in the useful word election 
section 210 may not necessarily be from the word registered into the dictionary 209 for feature- 
vector generation, and can perform election from the word of the larger range. 
[0050] As an example, a taxon presupposes that the word which are a, b, and c and which is 
registered into the dictionary 209 for feature-vector generation was three "politics, Japan, and 
international" noting that there are three. And the frequency of each word (suppose that 
frequency is investigated also about "Election" and a "problem" in addition to the word 
registered into the dictionary 209 for feature-vector generation) presupposes that it was as 
follows for every taxon. 

[0051] Taxon a 30% of politics, Japan 5%, 35% of international, 10% of Election, 20% taxon b of 
problems Politics 3%, 55% of Japan, 35% of international, Election 2%, problem 5% taxon c Politics 
3%, 30% of Japan, 35% of international, Election If it carries out 30% of problems 2% Since 
"international" is contained at the same rate as every taxon if an approach 1 is used, it will 
remove from the dictionary for feature-vector generation. Since "politics", "Japan", "Election", 
and a "problem" have a bias in the frequency for every taxon, it is elected as a useful word and 
registers with the dictionary 209 for feature-vector generation (what is necessary is to take only 
the number to register in order of the total frequency of occurrence in the word which has a bias 
in frequency to stop the number of registered words at this time). When an approach 2 is used, 
"politics" and "Election" are elected and it registers with the dictionary 209 for feature-vector 
generation, and "Japan", international [ "international" ], and a "problem" are not registered into 
the dictionary 209 for feature-vector generation. The method of electing a useful word by 
whether it is beyond a threshold with the ratio of the frequency of the 1st place and the 
frequency of the n~th place (n is the number of 3 or more and a taxon - one or less) as the in- 
between approach of an approach 1 and an approach 2 is also considered. Moreover, the method 
of electing what has large not the ratio of frequency but value of distribution of frequency is also 
considered. 



[0052] In addition, since it is possible that the word elected by doing in this way has the 
significance according to the ratio (or distribution of frequency) of frequency, if the feature 
vector of the word in that document is equalized after carrying out weighting according to this 
ratio (or distribution) when calculating the feature vector of a document (after hanging log (ratio 
of frequency) on that feature vector), the feature-vector ground of a better document may be 
obtained. 

[0053] In this way, only a useful word is registered into a classification, once again, if the feature 
vector of a word is learned and a document is classified using it, the dictionary for feature- 
vector generation can be made smaller, or the precision of a classification can be raised to the 
dictionary 209 for feature-vector generation. 

[0054] One example of claim 3 of this invention is shown in drawin g 6 . Here, what is expressed 
with signs 301-310 is the same as what is expressed with 201-210 of drawing 4 respectively. 
[0055] The storage section 301 document classification equipment remembers document data to 
be, and the document analysis section 302 which analyzes document data, The word vector 
generation section 303 which generates automatically the feature vector which expresses the 
description of each word using the coincidence relation between the words in a document, The 
document vector generation section 305 which generates the feature vector of a document from 
the feature vector of the word vector storage section 304 which memorizes the feature vector, 
and the word contained in the document, The document vector storage section 306 which 
memorizes the feature vector, and the classification section 307 which classifies a document 
using the similarity between the feature vectors of a document, The dictionary 309 for feature- 
vector generation in which the storage section 308 and the word used for a feature-vector 
generate time are registered as a result of memorizing the classified result, The useful word 
election section 310 which elects a useful word using the classification result memorized by the 
result storage section 308 at the time of a classification, It consists of the representation vector 
generation section 31 1 which asks for the feature vector which represents each taxon using the 
classification result memorized by the result storage section 308, and the representation vector 
storage section 312 which memorizes the representation vector generated in the representation 
vector generation section 311. 

[0056] In addition, in constituting the system of claim 3 using the example of claim 1, it becomes 
a system without the useful word election section 310. 

[0057] Drawing 7 is drawing showing the system configuration at the time of study and a 
classification. At first, the feature vector of a word is learned and a lot of document data are 
classified according to the same approach as the example of claim 1, or the example of claim 2 
based on it. Although the result storage section 308 memorizes, the classified result is carried 
out based on this result, and generates a representation vector in the representation vector 
generation section 31 1. this asking for the frequency of each word for every taxon, electing the 
word contained only in a certain taxon at a high rate, and taking the average of the feature 
vector of such a word — it is generable. As an example, a taxon presupposes that the word 
which are a, b, and c and which is registered into the dictionary 309 for feature-vector 
generation was three "politics, Parliament, and international" noting that there are three. And 
the frequency of each word for every taxon presupposes that it was as follows. 
[0058] 

Taxon a 40% of politics, 50% of Parliaments, 10% taxon b of international 10% of politics, 10% of 
Parliaments, 80% taxon c of international If it carries out 20% of politics, 10% of Parliaments, and 
70% of international, the representation vector of Taxon a will be given as an average of a 
"political" feature vector and the feature vector of "Parliament." In addition, giving weight is also 
considered by not a mere average but the appearance rate. For example, if the "political" 
frequency of occurrence is twice the frequency of occurrence of "Parliament", it is making into 
the representation vector of Taxon a what added the twice of a "political" feature vector, and 
the feature vector of "Parliament", and was divided by 3 etc. 

[0059] How to make what took the average of the feature vector of the document similarly 

classified into Taxon a the representation vector of Taxon a is also considered. 

[0060] In this way, the document read from the document storage section 301 can be classified 



now into the taxon corresponding to a representation vector most similar to the feature vector 
of that document according to referring to this representation vector at the time of a 
classification of future documents by memorizing it in the representation vector storage section 
312, if a representation vector is generated. 

[0061] This invention classifies an electronic mail and electronic news automatically, or it not 
only uses it for a document classification, but Elect what is likely to have a user's interest out of 
an electronic mail and electronic news, or (A user can judge by similarity with the feature vector 
of the mail read by then or news) Ambiguous retrieval (by searching the document which 
becomes beyond a threshold with the fixed similarity of the feature vector of a retrieval keyword, 
and the feature vector of the document for retrieval) Even if it does not match a retrieval 
keyword correctly, can use for the ability to refer to the keyword of relation, or Can use for 
selection (the homophenes is chosen by similarity with the feature vector obtained from the 
contents changed by then) of the homophenes in the conversion of kana into karji, or Also in 
case the approach of choosing the conversion result of having suited the past context most in 
speech recognition, handwriting recognition, etc. is taken (a recognition result is chosen by 
similarity with the feature vector obtained from the contents recognized by then), can use, or It 
can use, also in case retrieval space, such as a word, is narrowed in the time of recognition etc. 
(only the word corresponding to the element which has become among the elements of the 
feature vector obtained from the contents recognized by then beyond the fixed threshold is 
searched). 
[0062] 

[Effect] The feature vector of a word can be created automatically and a document can be 
automatically classified now according to this invention. Moreover, it is created by this approach 
and hangs down, and it can use not only for the time of a classification of a document but for 
selection of ambiguous retrieval and the homophenes in the conversion of kana into kar\ji, and in 
speech recognition, hand written character recognition, etc., also in case the feature vector of a 
word takes the approach of choosing the recognition result of having suited the past context 
most, it can be used. 
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2. **** shows the word which can not be translated. 
3.1n the drawings, any words are not translated. 



TECHNICAL FIELD 



[Industrial Application] This invention relates to the document classification equipment used for 
the document automatic card counting sorter which is similar by preservation/automatic one, a 
word processor/filling system, etc. in a document. 
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PRIOR ART 



[Description of the Prior Art] Conventionally, the automatic classification of a document was 
difficult, and the user classified manually, or extracted the keyword in a document, and was 
classifying using the thesaurus created beforehand. Moreover, the fundamental data for a 
classification also with the system called automatic classification needed to be inputted by the 
help in forms, such as a basic example. 
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EFFECT OF THE INVENTION 



[Effect] The feature vector of a word can be created automatically and a document can be 
automatically classified now according to this invention. Moreover, it is created by this approach 
and hangs down, and it can use not only for the time of a classification of a document but for 
selection of ambiguous retrieval and the homophenes in the conversion of kana into kanji, and in 
speech recognition, hand written character recognition, etc., also in case the feature vector of a 
word takes the approach of choosing the recognition result of having suited the past context 
most, it can be used. 



[Translation done.] 
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TECHNICAL PROBLEM 



[Problem(s) to be Solved by the Invention] however, the activity according to a help by such 
classification — a bottleneck — a sake — a lot of documents — classifying is very difficult. 
[0004] This invention was made in consideration of the above situation, and aims at offering the 
document classification equipment which classifies a document automatically through a help. 
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MEANS 



[Means for Solving the Problem] The storage section invention concerning claim 1 remembers 
document data to be in document classification equipment, The word vector generation section 
which generates automatically the feature vector which expresses the description of each word 
as the document analysis section which analyzes document data using the coincidence relation 
between the words in a document, The document vector generation section which generates the 
feature vector of a document from the feature vector of the word vector storage section which 
memorizes the feature vector, and the word contained in the document, The document vector 
storage section which memorizes the feature vector, and the classification section which 
classifies a document using the similarity between the feature vectors of a document, As a 
result of memorizing the classified result, it has the storage section and the dictionary for 
feature-vector generation in which the word used for a feature-vector generate time is 
registered. The feature vector expressing the description of each word is automatically 
generated using the coincidence relation between the words in a lot of text files, and it is 
characterized by the ability to classify a document automatically. 

[0006] Moreover, in addition to the above-mentioned configuration, invention concerning claim 2 
is further equipped with the useful word election section which elects a useful word using the 
classification result memorized by the result storage section at the time of a classification. It is 
characterized by the ability to raise the precision of a classification by electing a useful word as 
a classification and using only a useful word for it at a classification by investigating the 
incidence of a word for each [ which was classified ] of that taxon of every, after classifying a lot 
of text files. 

[0007] Moreover, the representation vector generation section which asks for the feature vector 
which represents each taxon using the classification result invention concerning claim 3 is 
remembered to be by the result storage section in addition to the above-mentioned 
configuration, It has further the representation vector storage section which memorizes the 
representation vector generated in the representation vector generation section. After 
classifying a lot of text files, it is characterized by the ability to ask for the feature vector 
representing the field using the word for every taxon and feature vector of a document which 
were classified. 
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OPERATION 



[Function] The operation at the time of study of the feature vector of the word in claim 1 is 
explained. The contents of a lot of text files memorized by the document storage section are 
passed to the document analysis section, analyses (morphological analysis etc.) of a sentence 
are performed, coincidence relation, the frequency of occurrence, etc. of a word are analyzed in 
the word vector generation section, and the feature vector of each word is generated. In this 
way, the feature vector of the generated word is memorized by the word vector storage section. 
Thus, study of the feature vector of a word is performed. The storage space of a feature vector 
prevents becoming huge too much with restricting the word which generates a feature vector to 
the word registered into the dictionary for feature-vector generation. 

[0009] The operation at the time of a classification of the document in claim 1 is explained. The 
contents of the text file memorized by the document storage section when classifying a text are 
passed to the document analysis section, analyses (morphological analysis etc.) of a sentence 
are performed, in the document vector generation section, it asks for the feature vector of the 
word which appears when analyzing a sentence in the document analysis section with reference 
to the word vector storage section, and the feature vector of a document is generated from the 
feature vector of the word contained in a document. In this way, the feature vector of the 
generated document is memorized by the document vector storage section, and classifies a 
document according to the similarity between the feature vectors of this document in the 
classification section. This classification result is memorized by the result storage section. 
[0010] With a configuration according to claim 2, after performing a classification of a lot of 
documents, a useful word is elected in the useful word election section using the classification 
result memorized by the result storage section at the time of a classification. By classifying 
again using the feature vector of the word which was made to learn the feature vector of a word 
again after registering into the dictionary for feature-vector generation only the word elected by 
the useful word election section, then was obtained, the storage spaces of a feature vector can 
be reduced rather than the configuration of claim 1 , and the precision of a classification can also 
be raised. 

[001 1] With a configuration according to claim 3, it asks for the feature vector which represents 
each taxon with the representation vector generation section using the classification result 
memorized by the result storage section, after performing a classification of a lot of documents. 
The representation vector generated in the representation vector generation section is 
memorized by the representation vector storage section. Once it generates the representation 
vector of each taxon, when classifying new document data, it can judge to which taxon the 
document belongs only by performing the comparison with the feature vector of the document, 
and the representation vector of each taxon. 
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EXAMPLE 



[Example] Hereafter, the suitable example of this invention is explained in full detail based on a 
drawing. 

[0013] One example of invention concerning claim 1 is shown in drayvingj^ . The storage section 
101 document classification equipment remembers document data to be, and the document 
analysis section 102 which analyzes document data, The word vector generation section 103 
which generates automatically the feature vector which expresses the description of each word 
using the coincidence relation between the words in a document, The document vector 
generation section 105 which generates the feature vector of a document from the feature 
vector of the word vector storage section 104 which memorizes the feature vector, and the 
word contained in the document, The document vector storage section 106 which memorizes the 
feature vector, and the classification section 107 which classifies a document using the similarity 
between the feature vectors of a document, As a result of memorizing the classified result, it 
consists of the storage section 108 and a dictionary 109 for feature-vector generation in which 
the word used for a feature-vector generate time is registered. 

[0014] It is more realistic to restrict the word used in case a feature vector is created, since 
there are very many words currently generally used for the usual document. For this reason, it is 
the dictionary 109 for feature-vector generation, and creating the feature vector of a word only 
using the word registered here uses, and it can suppress growing gigantic of the storage space 
of a feature vector. 

[0015] Drawing 2 shows the system configuration at the time of study of the feature vector of a 
word, at the time of study of the feature vector of a word, a lot of document data document 
storage section 101 for study is made to memorize, the document data read from the document 
storage section 101 are read into the document analysis section 102 for every suitable lumps, 
such as a report, a paragraph, and one etc. sentence, the document data is analyzed in the 
document analysis section 102, and a word is extracted. The feature vector of the word which 
generated the feature vector of a word in the word vector generation section 103 based on the 
word train extracted here, and was generated in the word vector generation section 103 is 
memorized by the word vector storage section 1 04. In this way, the feature vector of a word is 
learned. 

[0016] When drawin g 3 shows the system configuration at the time of a document classification 
and a document is classified, the document data which the document storage section 101 was 
made to memorize the data of the document to classify, and read them from the document 
storage section 101 are read into every [ to make it classify into ] unit (for example, report unit) 
by the document analysis section 102, the document data is analyzed in the document analysis 
section 102, and a word is extracted. It asks for the feature vector of the word extracted here 
with reference to the contents of the word vector storage section of 104. Usually, although two 
or more words are extracted from one unit (for example, one report) of document data, at this 
time, the feature vector of a document is calculated by equalizing the value of the feature vector 
of all the words called for. a better value may be acquired, if it does not equalize simply, but each 
feature vector is equalized at this time after carrying out weighting according to the inverse 
number of that frequency of occurrence, investigating the number of reports in which that word 



has appeared from a lot of reports and hanging log (the number of reports in which the total 
number of reports / its word has appeared) on the feature vector of that word for example, — 
[0017] If the feature vector of a document can be found, a document can be classified according 
to applying the technique of the conventional clustering. What is necessary is just to consider 
that documents with a distance near [ this ] between the feature vectors of a document belong 
to the same field. 

[0018] Moreover, human being chooses a typical document for every taxon, the temporary 
representation vector of the taxon is generated from the feature vector of the word extracted 
from the document, and the feature vector of the document read from the document storage 
section 101 can also classify a document according to whether to be close to the temporary 
representation vector of which taxon. If document data are made to read from the document 
storage section 101 in large quantities also by such classification technique, the effect of the 
error that human being has chosen the temporary representation vector decreases, and, finally 
the quite general representation vector for each field can be generated. 

[001 9] Then, the method of generating the feature vector of a word is explained concretely. The 
feature vector of a word is obtained by adding what applied the frequency of occurrence in the 
inside of the document data of the lump of the word to frequency-of-occurrence distribution of 
the word contained in the document data of a lump. A concrete example explains. 
[0020] Example A "the American government has proposed radical reexamination of the COCOM 
regulation to the advanced mayor power." 

Example B "to which it seems that the country for regulation is inclined to reduce COCOM's 
regulated items sharply on condition that export of the industrial product which leads to 
manufacture of arms is regulated" 

It explains how the feature vector of a word is created from the document data to say. Here 
although [ document data ] read in the unit of "one sentence", other units, such as one report, 
are sufficient as this. 

[0021] moreover, the number of dimension of a feature vector — 21 dimensions (the number of 
words registered into the dictionary for feature-vector generation is 21) — each element — 
"United States, the government advanced, main, a country, COCOM, and regulation — it 
improves and suppose that radical and the word a proposal, an object, arms, manufacture, 
industry, a product, export, conditions, items, large, reduction, and intention" are supported. 
[0022] under such conditions, if Example A is read from the document storage section 101, the 
document analysis section 102 will analyze — having — "United States, the government, 
advanced, main, a country, and COCOM — radical [ regulation and ] — it improves and 
proposal" is extracted. At this time, 1 is added to the element corresponding to these words of 
the feature vectors of all these words in the word vector generation section 103. Then, the 
"United States", the "government", etc. add (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) 
to the feature vector of the word which appears in Example A. The thing illustrating this is 
draw ing 8 . 

[0023] Next, if Example B is read from the document storage section 101, it will be analyzed in 
the document analysis section 102, and "regulation, an object, a country, arms, manufacture, 
industry, a product, export, regulation, conditions, COCOM, regulation, items, large, reduction, 
and an intention" will be extracted. 

[0024] The word frequency-of-occurrence distribution acquired from now on is (0, 0, 0, 0, 1, 1, 3, 
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1). It adds to the feature vector of "regulation", the vector which 
doubled this word frequency-of-occurrence distribution three since "regulation" had appeared 3 
times — it is (0, 0, 0, 0, 3, 3, 9, 0, 0, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3) — An "object", a "country", 
etc. add (0, 0, 0, 0, 1, 1, 3, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) to the feature vector of the word 
which appears in Example B only once. The thing illustrating this is d rawin g 9 . 
[0025] In addition, since the magnitude of a vector added with the die length of a sentence by 
the approach of always adding an integer in this way changes, how to add, after normalizing the 
absolute value of a vector to add to 1, or normalizing the absolute value of a vector of 
frequency-of-occurrence distribution to 1 and applying the value proportional to the number of 
appearances etc. is considered. 
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[0026] And the feature vector finally obtained normalizes the absolute value to 1. 
[0027] In this way, the feature vector of the obtained word is memorized by the word vector 
storage section 104, and is used at the time of a classification of a document. 
[0028] Next, the time of the following examples C being read considering processing of feature- 
vector generation of the document at the time of a document classification as an example is 
raised and explained. 

[0029] Example C "the American government proposed reduction of arms." 

If Example C is read from the document storage section 101, it will be analyzed in the document 
analysis section 102, and "the United States, the government, arms, reduction, and a proposal" 
will be extracted. Then, in the document vector generation section 105, with reference to the 
contents of the word vector storage section 104, the "United States", the "government", etc. 
add the feature vector of the word which appears in Example C, and get (3, 3, 3, 3, 5, 5, 9, 3, 3, 3, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2) as a feature vector of Example C. The thing illustrating this is drawin g 
10 . In drawin g 10 , since priority is given to intelligibility, the normalization of a vector is not 
performed, but in actual processing, after normalizing the absolute value of the feature vector of 
each word to 1 before adding, it adds. The obtained feature vector is memorized by the 
document vector storage group 106. 

[0030] Next, it explains how the feature vector of a document is used in the classification 
section 107 at the time of a classification. After normalizing simply the absolute value of the 
feature vector of the document which was able to be found first to 1 Although what is necessary 
is to classify using a certain technique from the former, such as the K-means method, or just to 
classify according to similarity (obtained by finding distance or calculating an inner product) with 
the representation (temporary) vector of a taxon Since the feature vector obtained by this 
technique has the description "the value of the element corresponding to the word appearing 
[ many ] becomes very large", a classification result with it better [ to devise so that this 
description may not have a bad influence on a classification ] is obtained in many cases. For 
example, distance by count which the difference between elements does not expand in finding 
distance (although the square root of the sum of squares of the difference between each 
element is usually calculated) For example, it is good to normalize, after taking log for all 
elements or taking a power root, before it is better to have used the distance which calculated 
and asked for the sum of the absolute value of the difference between each element and asking 
for an inner product, and to calculate, after leveling a value. 

[0031] The juniper currently asked for those with three, and the representation vector of each 
taxon for the taxon as follows as an example of a classification. 

[0032] The representation vector of a taxon 1 (1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 
1) 

The representation vector of a taxon 2 (1, 1, 1,1,1,1,1, 1,1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5) 

The representation vector of a taxon 3 (4, 4, 4, 4, 6, 6, 6, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1 f 1, 1) 

After the feature vector of a document and the representation vector of a taxon normalize an 

absolute value to 1, supposing what calculates both inner product and takes the biggest value is 

the highest as a scale of similarity as for similarity, it is the feature vector [0033] of Example C. 

[Equation 1] 
1 

1/ 238 

[0034] (3,3,3,3,5,5,9,3,3,3,2,2,2,2,2,2,2,2,2,2,2) 
The representation vector of a taxon 1 [0035] 

[Equation 2] 

l 

✓ 8 

[0036] (1,1,1,1 ,0,0,0,0,0,0,0,0,0,0,0,0,0,1 ,1,1,1) 

The representation vector of a taxon 2 [0037] 

[Equation 3] 
1 

V 285 



[0038] (1,1,1,1,1,1,1,1,1,1,5,5,5,5,5,5,5,5,5,5,5) 

The representation vector of a taxon 3 [0039] 

[Equation 4] 
l 

V> 2 1 0 

[0040] (4,4,4,4,6,6,6,3,3,3,1 ,1 ,1 ,1,1,1,1,1,1,1,1) 

since — the inner product of the feature vector of Example C, and the representation vector of 
each taxon — an inner product [0041] with a taxon 1 
[Equation 5] 



. 20 = 0. 4 5 83 



V 2 38 / 8 

[0042] An inner product with a taxon 2 [0043] 
[Equation 6] 

. 1 ■■ " j „ - 15 0 = 0. 57 59 

y 238 / 285 

[0044] An inner product with a taxon 3 [0045] 
[Equation 7] 

. 1 • , 1 • 2 1 1=0. 94 3 8 

✓ 23 8 / 21 0 

[0046] Since it turns out that the feature vector of a next door and Example C is the closest to 
the representation vector of a taxon 3, Example C is classified into a taxon 3. Drawing 1 1 
illustrated this. Since priority is given to intelligibility like [ drawin g 1 1 ] drawin g 10 , the 
normalization of a vector is not performed, but in actual processing, after normalizing each 
absolute value of a vector to 1 before comparing, it compares. The classified result is memorized 
by the result storage section 108. 

[0047] Next, one example of claim 2 of this invention is shown in drawing 4 . Here, what is 
expressed with signs 201-209 is the same as what is expressed with the signs 101-109 of 
drawing 1 respectively. 

[0048] The storage section 201 document classification equipment remembers document data to 
be, and the document analysis section 202 which analyzes document data, The word vector 
generation section 203 which generates automatically the feature vector which expresses the 
description of each word using the coincidence relation between the words in a document, The 
document vector generation section 205 which generates the feature vector of a document from 
the feature vector of the word vector storage section 204 which memorizes the feature vector, 
and the word contained in the document, The document vector storage section 206 which 
memorizes the feature vector, and the classification section 207 which classifies a document 
using the similarity between the feature vectors of a document, As a result of memorizing the 
classified result, it consists of the storage section 208, a dictionary 209 for feature-vector 
generation in which the word used for a feature-vector generate time is registered, and the 
useful word election section 210 which elects a useful word using the classification result 
memorized by the storage section 208 the result at the time of a classification. 
[0049] Drawin g 5 is drawing showing the system configuration at the time of study and a 
classification. At first, the feature vector of a word is learned and a lot of document data are 
classified according to the same approach as the example of claim 1 based on it. Although the 
classified result is memorized by the result storage section 208, it carries out based on this 
result, and a useful word is elected in the useful word election section 210. This asks for the 
frequency of each word for every taxon, removes the word contained at the same rate as every 
taxon, or elects a ********** thing for the following [ a threshold with the ratio of the (approach 
1:highest frequency and the minimum frequency ] only as removal) and a certain taxon at a high 
rate (approach 2: elect the thing beyond the highest frequency and a threshold with the second 
place of a ratio with frequency). In addition, the word which elects in the useful word election 



section 210 may not necessarily be from the word registered into the dictionary 209 for feature- 
vector generation, and can perform election from the word of the larger range. 
[0050] As an example, a taxon presupposes that the word which are a, b, and c and which is 
registered into the dictionary 209 for feature-vector generation was three "politics, Japan, and 
international" noting that there are three. And the frequency of each word (suppose that 
frequency is investigated also about "Election" and a "problem" in addition to the word 
registered into the dictionary 209 for feature-vector generation) presupposes that it was as 
follows for every taxon. 

[0051] Taxon a 30% of politics, Japan 5%, 35% of international, 10% of Election, 20% taxon b of 
problems Politics 3%, 55% of Japan, 35% of international, Election 2%, problem 5% taxon c Politics 
3%, 30% of Japan, 35% of international, Election If it carries out 30% of problems 2% Since 
"international" is contained at the same rate as every taxon if an approach 1 is used, it will 
remove from the dictionary for feature-vector generation. Since "politics", "Japan", "Election", 
and a "problem" have a bias in the frequency for every taxon, it is elected as a useful word and 
registers with the dictionary 209 for feature-vector generation (what is necessary is to take only 
the number to register in order of the total frequency of occurrence in the word which has a bias 
in frequency to stop the number of registered words at this time). When an approach 2 is used, 
"politics" and "Election" are elected and it registers with the dictionary 209 for feature-vector 
generation, and "Japan", international [ "international" ], and a "problem" are not registered into 
the dictionary 209 for feature-vector generation. The method of electing a useful word by 
whether it is beyond a threshold with the ratio of the frequency of the 1st place and the 
frequency of the n-th place (n is the number of 3 or more and a taxon - one or less) as the in- 
between approach of an approach 1 and an approach 2 is also considered. Moreover, the method 
of electing what has large not the ratio of frequency but value of distribution of frequency is also 
considered. 

[0052] In addition, since it is possible that the word elected by doing in this way has the 
significance according to the ratio (or distribution of frequency) of frequency, if the feature 
vector of the word in that document is equalized after carrying out weighting according to this 
ratio (or distribution) when calculating the feature vector of a document (after hanging log (ratio 
of frequency) on that feature vector), the feature-vector ground of a better document may be 
obtained. 

[0053] In this way, only a useful word is registered into a classification, once again, if the feature 
vector of a word is learned and a document is classified using it, the dictionary for feature- 
vector generation can be made smaller, or the precision of a classification can be raised to the 
dictionary 209 for feature-vector generation. 

[0054] One example of claim 3 of this invention is shown in drawing 6 . Here, what is expressed 
with signs 301-310 is the same as what is expressed with 201-210 of drawin g 4 respectively. 
[0055] The storage section 301 document classification equipment remembers document data to 
be, and the document analysis section 302 which analyzes document data, The word vector 
generation section 303 which generates automatically the feature vector which expresses the 
description of each word using the coincidence relation between the words in a document, The 
document vector generation section 305 which generates the feature vector of a document from 
the feature vector of the word vector storage section 304 which memorizes the feature vector, 
and the word contained in the document, The document vector storage section 306 which 
memorizes the feature vector, and the classification section 307 which classifies a document 
using the similarity between the feature vectors of a document, The dictionary 309 for feature- 
vector generation in which the storage section 308 and the word used for a feature-vector 
generate time are registered as a result of memorizing the classified result, The useful word 
election section 310 which elects a useful word using the classification result memorized by the 
result storage section 308 at the time of a classification, It consists of the representation vector 
generation section 31 1 which asks for the feature vector which represents each taxon using the 
classification result memorized by the result storage section 308, and the representation vector 
storage section 312 which memorizes the representation vector generated in the representation 
vector generation section 31 1. 



[0056] In addition, in constituting the system of claim 3 using the example of claim 1 , it becomes 
a system without the useful word election section 310. 

[0057] DrawingJ7 is drawing showing the system configuration at the time of study and a 
classification. At first, the feature vector of a word is learned and a lot of document data are 
classified according to the same approach as the example of claim 1, or the example of claim 2 
based on it. Although the result storage section 308 memorizes, the classified result is carried 
out based on this result, and generates a representation vector in the representation vector 
generation section 31 1. this asking for the frequency of each word for every taxon, electing the 
word contained only in a certain taxon at a high rate, and taking the average of the feature 
vector of such a word — it is generable. As an example, a taxon presupposes that the word 
which are a, b, and c and which is registered into the dictionary 309 for feature-vector 
generation was three "politics, Parliament, and international" noting that there are three. And 
the frequency of each word for every taxon presupposes that it was as follows. 
[0058] 

Taxon a 40% of politics, 50% of Parliaments, 10% taxon b of international 10% of politics, 10% of 
Parliaments, 80% taxon c of international If it carries out 20% of politics, 10% of Parliaments, and 
70% of international, the representation vector of Taxon a will be given as an average of a 
"political" feature vector and the feature vector of "Parliament." In addition, giving weight is also 
considered by not a mere average but the appearance rate. For example, if the "political" 
frequency of occurrence is twice the frequency of occurrence of "Parliament", it is making into 
the representation vector of Taxon a what added the twice of a "political" feature vector, and 
the feature vector of "Parliament", and was divided by 3 etc. 

[0059] How to make what took the average of the feature vector of the document similarly 
classified into Taxon a the representation vector of Taxon a is also considered. 
[0060] In this way, the document read from the document storage section 301 can be classified 
now into the taxon corresponding to a representation vector most similar to the feature vector 
of that document according to referring to this representation vector at the time of a 
classification of future documents by memorizing it in the representation vector storage section 
312, if a representation vector is generated. 

[0061] This invention classifies an electronic mail and electronic news automatically, or it not 
only uses it for a document classification, but Elect what is likely to have a users interest out of 
an electronic mail and electronic news, or (A user can judge by similarity with the feature vector 
of the mail read by then or news) Ambiguous retrieval (by searching the document which 
becomes beyond a threshold with the fixed similarity of the feature vector of a retrieval keyword, 
and the feature vector of the document for retrieval) Even if it does not match a retrieval 
keyword correctly, can use for the ability to refer to the keyword of relation, or Can use for 
selection (the homophenes is chosen by similarity with the feature vector obtained from the 
contents changed by then) of the homophenes in the conversion of kana into kanji, or Also in 
case the approach of choosing the conversion result of having suited the past context most in 
speech recognition, handwriting recognition, etc. is taken (a recognition result is chosen by 
similarity with the feature vector obtained from the contents recognized by then), can use, or It 
can use, also in case retrieval space, such as a word, is narrowed in the time of recognition etc. 
(only the word corresponding to the element which has become among the elements of the 
feature vector obtained from the contents recognized by then beyond the fixed threshold is 
searched). 



[Translation done.] 



* NOTICES * 



JPO and NCI PI are not responsible for any 
damages caused by the use of this translation. 

LThis document has been translated by computer. So the translation may not reflect the original 
precisely. 

2.**** shows the word which can not be translated. 
3.1n the drawings, any words are not translated. 



DESCRIPTION OF DRAWINGS 



[Brief Description of the Drawings] 

[ Draw ing 1] It is the block diagram showing the basic configuration of one example of invention 
concerning claim 1. 

[ Drawing 2] It is the block diagram showing the system configuration at the time of study of the 
system shown in drawin g 1 . 

[Drawing 3] It is the block diagram showing the system configuration at the time of a 
classification of the system shown in drawing 1 . 

[Drawing 4] It is the block diagram showing the basic configuration of one example of invention 
concerning claim 2. 

[ Drawin g 5] It is the block diagram showing study of the system shown in drawing 4 , and the 
system configuration at the time of a classification. 

[Drawing 6] It is the block diagram showing the basic configuration of one example of invention 
concerning claim 3. 

[ Drawin g 7] It is the block diagram showing study of the system shown in drawin g 6 , and the 
system configuration at the time of a classification. 

[ Drawing 8 ] It is drawing explaining generation of the feature vector of a word. 
[ Drawin g 9] It is drawing explaining generation of the feature vector of a word. 
[ Drawing 10] It is drawing explaining generation of the feature vector of a document. 
[D rawin g 1 1 ] It is drawing explaining a classification of a document. 
[Description of Notations] 

101, 201, 301 Document storage section 

102, 202, 302 Document analysis section 

103, 203, 303 Word vector generation section 

104, 204, 304 Word vector storage section 

105, 205, 305 Document veqtor generation section 

1 06, 206, 306 Document vector storage section 

107, 207, 308 Classification section 

108, 208, 308 Result storage section 

109, 209, 309 Dictionary for feature-vector generation 
210 310 Useful word election section 

311 Representation Vector Generation Section 

312 Representation Vector Storage Section 



[Translation done.] 



